One of the biggest issues the world is currently facing is the rising numbers of terrorist attacks. Most people hear via the news or on the internet but we never really spectulate on the statistics behind terrorism. In this tutorial we will explore a dataset related to terrostic attacks to demonstrate the process of the data science pipeline.
The dataset we will be using is from the Global Terrorism Database, which has data available from 1970-2016 which is alos available here. The purpose for choosing this data set is that it has many attributes to describe each entity, which is represented as an act of terrorism. From the data we can start to analyze the statistical outcomes in the hopes of discovering trends.
I first downloaded the data as a comma seperated value file, and collected the attributes necessary for an analysis.
Renaming of column attributes:
setnames(tidy_terror, old=c("iyear","country_txt","region_txt", "attacktype1_txt","targtype1_txt","weaptype1_txt"),
new=c("Year", "Country","Region","Attack_Type","Target_Type","Weapon"))
To get an understanding of what the data is representing in the broadest of scales, lets first display a graph of the number of successful attacks world wide over the entire range of the dataset, which is from 1970-2016. The intiution is that we may be able to find a trend or an interesting observation that could lead to further exploration of the data. In order to do this we can use the ggplot2 package to visualize the data as a plot of points connected by line segments.
tidy_terror <-na.omit(tidy_terror)
plot <- tidy_terror %>%
group_by(Year) %>%
summarize(success = sum(success)) %>%
ggplot(mapping=aes(y = success, x = Year)) +
geom_line(color = "blue") +
geom_smooth(method =lm)+
geom_point(size=1.75, colour="#CC0000") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))+
labs(title="Number of Successful Terrorist Attacks Worldwide (1970-2016)",
y = "# of Successful Attacks",
x = "Year")+
scale_x_continuous(breaks=seq(1970,2016,by=5))
scale_y_continuous(breaks=seq(0,15000,by=5000))
## <ScaleContinuousPosition>
## Range:
## Limits: 0 -- 1
plot
From the plot above its very clear that there seems to be a corelation between the number of successful attacks over the last 46 years. The rise in successful attacks over the last 4 decades could stem from a number of theories. The first being a rise in climate change, which creates a number of scenarios in which terrorists can use valuable resources to gain control over people, this is explimpfied in areas plagued by famine and drought, where water can be control by a terrorist organization. The world has also seen a rise in radical terrorist groups starting in the early 2000s, which could explain why the graph experiences a huge spike. Climate change & terrorism link For information reguarding the ggplot2 package, check out this link!
So now that we have a general idea for the trend, or more appropriately, the rise in terrorism over the last 46 years. Lets try to better visualize the data by displaying the # of succesful attacks by region, to see if any particular regions stick out.
plot <- tidy_terror %>%
group_by(Year, Region) %>%
summarize(success = sum(success)) %>%
ggplot(mapping=aes(y = success, x = Year, color= Region)) +
geom_point()+
geom_line()+
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))+
labs(title="Number of Successful Terrorist Attacks By Region (1970-2016)",
y = "# of Successful Attacks",
x = "Year")+
scale_x_continuous(breaks=seq(1970,2016,by=5))+
scale_y_continuous(breaks=seq(0,6500,by=500))
plot
From the new graph we can see that the Middle East & Northern Africa have experienced huge spikes along with South America and Sub-Sahara Africa. Our original hypothesis for a rise in terrrorism due to radicalism could prove valid for the spike in the Middle East & North Africa line, but what can we say for South America and Sub-Sahara Africa? Well for starters, all 3 regions are generally less developed economically when compared to North America/Europe who have experience far less terrorist attacks over the years.
Since we are more focused on the rise in terrorism we can eliminate the regions that appear to have a stagnat overall growth in successful attacks and instead focus on the top 4 regions: Middle East & North Africa, South Asia, Sub-Saharan Africa and Southeast Asia. We also only care about the recent rise, since most regions have displayed an overall stagnant or slow increase before the year 2000, so lets consider the data from 2000-2016. Fitting each region to a linear regression line will give us a better understanding of the trend of the data.
top_terror <- tidy_terror %>%
filter(Region == c("Middle East & North Africa","South Asia","Sub-Saharan Africa","Southeast Asia"), Year >= 2000)
plot <- top_terror %>%
group_by(Year, Region) %>%
summarize(success = sum(success)) %>%
ggplot(mapping=aes(y = success, x = Year, color= Region)) +
geom_point()+
geom_line()+
geom_smooth(method=lm)+
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))+
labs(title="Number of Successful Terrorist Attacks for the Top 4 Regions (2000-2016)",
y = "# of Successful Attacks",
x = "Year")+
scale_x_continuous(breaks=seq(2000,2016,by=2))+
scale_y_continuous(breaks=seq(0,6500,by=200))
plot
From the linear regression lines we generated for each of the top 4 regions we can see that although each line has a few outliers, the overall slopes are heading in the positive direction over the past 16 years. We can now start to think about whether or not there is a relationship between successful attacks and the type of attack.
Lets look at the relationship between the number of successful attacks over time to the type of attack, sepereated by region and see if we can learn anything about the relationship. We will look at the Middle East & North Africa, Southeastern Asia, South Asia and Sub-Sahara Africa regions for this analysis:
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 3.4.4
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
plot1 <- top_terror %>%
filter(Region == "Middle East & North Africa") %>%
group_by(Year,Attack_Type) %>%
summarize(success = sum(success)) %>%
ggplot(mapping=aes(y = success, x = Year, color=Attack_Type, fill = Attack_Type)) +
geom_bar(stat="identity")+
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))+
labs(title="Distribution of Attack Types for the Middle East & North Africa",
y = "# of Successful Attacks",
x = "Year")+
scale_x_continuous(breaks=seq(2000,2016,by=2))
plot1
plot2 <- top_terror %>%
filter(Region == "South Asia") %>%
group_by(Year,Attack_Type) %>%
summarize(success = sum(success)) %>%
ggplot(mapping=aes(y = success, x = Year, color=Attack_Type, fill = Attack_Type)) +
geom_bar(stat="identity")+
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))+
labs(title="Distribution of Attack Types for South Asia",
y = "# of Successful Attacks",
x = "Year")+
scale_x_continuous(breaks=seq(2000,2016,by=2))
plot2
plot3 <- top_terror %>%
filter(Region == "Sub-Saharan Africa") %>%
group_by(Year,Attack_Type) %>%
summarize(success = sum(success)) %>%
ggplot(mapping=aes(y = success, x = Year, color=Attack_Type, fill = Attack_Type)) +
geom_bar(stat="identity")+
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))+
labs(title="Distribution of Attack Types for Sub-Saharan Africa",
y = "# of Successful Attacks",
x = "Year")+
scale_x_continuous(breaks=seq(2000,2016,by=2))
plot3
plot4 <- top_terror %>%
filter(Region == "Southeast Asia") %>%
group_by(Year,Attack_Type) %>%
summarize(success = sum(success)) %>%
ggplot(mapping=aes(y = success, x = Year, color=Attack_Type, fill = Attack_Type)) +
geom_bar(stat="identity")+
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))+
labs(title="Distribution of Attack Types for Southeast Asia",
y = "# of Successful Attacks",
x = "Year")+
scale_x_continuous(breaks=seq(2000,2016,by=2))
plot4
From all 4 bar plots its clear that Armed Assult, Assasination and Bombing/Explosion appear in that order, as the top 3 types of attacks that all 4 regions experience. Whats interesting is that each reach seems to follow almost an identitical relationship of increasing for the top 3 attack types is also increaseing at the same rates across each of the 4 regions over the past 16 years.
To get the annual success rate, we should group the data by the year and by the successes. This requires us to create some new attributes to our dataset. We will sum the total successes and the total number of indicents and create a success_rate attribute which divides the total successes by the total number of incidents, which are both also added as attributes.
success_rate_df <- tidy_terror %>%
group_by(Year) %>%
summarize(success=sum(success), x=n())%>%
mutate(total_successes =success, total_attempts = x,success_rate = total_successes/total_attempts) %>%
select(Year, total_successes, total_attempts, success_rate)
success_rate_df
## # A tibble: 46 x 4
## Year total_successes total_attempts success_rate
## <int> <int> <int> <dbl>
## 1 1970 542 643 0.843
## 2 1971 410 461 0.889
## 3 1972 448 492 0.911
## 4 1973 425 465 0.914
## 5 1974 540 576 0.938
## 6 1975 693 727 0.953
## 7 1976 839 900 0.932
## 8 1977 1167 1292 0.903
## 9 1978 1340 1453 0.922
## 10 1979 2280 2529 0.902
## # ... with 36 more rows
plot <- success_rate_df %>%
ggplot(mapping=aes(y = success_rate, x = Year)) +
geom_line(color = "blue") +
geom_smooth(method =lm)+
geom_point(size=1.75, colour="#CC0000") +
geom_smooth(method=lm)+
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))+
labs(title="Percentage of Successful Terrorist Attack Attempts by Year (1970-2016)",
y = "% of Successful Attacks",
x = "Year")+
scale_x_continuous(breaks=seq(1970,2016,by=5))+
scale_y_continuous(breaks=seq(0.8,1,by=.02))
plot
The graph does not appear to follow any linear relationship between the percentage of successful attacks overtime. What is interesting is although the rate seems to be decreasing from about 2007 and onward, the ammount of overall successful attacks per year is still rising while the success rate is decreasing. This implies that the total amount of crimes must be increasing at extreme rates to counter the decreasing percentage of successful attacks to keep an overall increasing attack total per year. Performing a linear regression for this specific data would be futile because we see no apparent relationship between the % of sucessful attacks over time.
While seeing graphical presentations of the data is an effective way to relay infromation about the dataset, we can also take advantage of the leaflet package. This package allows us to represent data dynamically on a map. Using the latitude and longitude points from the dataset we can better visualize which parts in particular are experiencing the highest levels of terrorism. We will accomplish this using the WebGLHeatmap function, which allows us to add a heatmap overlay onto our map using the lat and long points to plot the specific events of an act of terrorism.
library(leaflet)
## Warning: package 'leaflet' was built under R version 3.4.4
library(maps)
## Warning: package 'maps' was built under R version 3.4.4
##
## Attaching package: 'maps'
## The following object is masked from 'package:purrr':
##
## map
library(leaflet.extras)
## Warning: package 'leaflet.extras' was built under R version 3.4.4
worldmap <- leaflet(tidy_terror) %>%
addTiles() %>%
addWebGLHeatmap(lng=~longitude, lat=~latitude, size =100000) %>%
setView(lat=39.29, lng=-76.61, zoom=1.5)
worldmap
The map gives us
We can address this question by first, filtering to the years of interest and standardize successful attacks % for each region. To train our model we need a table with one row per region, We will do this in stages, first we turn the tidy dataset into a wide dataset using tidyr::spread then create a dataframe containing the differences we use as features. Finally, add the outcome we want to predict from the data frame we created previously. Now split up the data set into training regions and testing regions (using 80/20 random split). We can then make predictions on the held out test set and can make a confusion matrix to calculate error rate.
success_rate_df1 <- tidy_terror %>%
group_by(Region,Year) %>%
summarize(success=sum(success), x=n())%>%
mutate(total_successes =success, total_attempts = x,success_rate = total_successes/total_attempts) %>%
select(Year,Region, total_successes, total_attempts, success_rate)
success_rate_df1
## # A tibble: 513 x 5
## # Groups: Region [12]
## Year Region total_successes total_attempts success_rate
## <int> <chr> <int> <int> <dbl>
## 1 1970 Australasia & Oceania 1 1 1.00
## 2 1971 Australasia & Oceania 1 1 1.00
## 3 1972 Australasia & Oceania 2 2 1.00
## 4 1973 Australasia & Oceania 1 1 1.00
## 5 1974 Australasia & Oceania 1 1 1.00
## 6 1978 Australasia & Oceania 2 2 1.00
## 7 1979 Australasia & Oceania 2 2 1.00
## 8 1980 Australasia & Oceania 6 7 0.857
## 9 1981 Australasia & Oceania 3 3 1.00
## 10 1982 Australasia & Oceania 2 2 1.00
## # ... with 503 more rows
standardized_df1 <- success_rate_df1 %>%
filter(Year >=2000) %>%
group_by(Region ) %>%
mutate(mean_success = mean(success_rate)) %>%
mutate(sd_success = sd(success_rate)) %>%
mutate(zScore_success = (success_rate - mean_success) / sd_success) %>%
ungroup()
wide_df1 <- standardized_df1 %>%
select(Region, Year, zScore_success) %>%
tidyr::spread(Year, zScore_success)
#wide_df
matrix_1a <- wide_df1 %>%
select(-Region) %>%
as.matrix() %>%
.[,-1]
matrix_2a <- wide_df1 %>%
select(-Region) %>%
as.matrix() %>%
.[,-ncol(.)]
diff_df1a <- (matrix_1a - matrix_2a) %>%
magrittr::set_colnames(NULL) %>%
as_data_frame() %>%
mutate(Region = wide_df1$Region)
final_df1a <- diff_df1a %>%
inner_join(tidy_terror %>% select(Region, success), by="Region") %>%
mutate(success=factor(success, levels=c(1, 0)))
#final_df
set.seed(1234)
test_random_forest_df1a <- final_df1a %>%
group_by(success) %>%
sample_frac(.1) %>%
ungroup()
test_random_forest_df1a <-na.omit(test_random_forest_df1a)
train_random_forest_df1a <- final_df1a %>%
anti_join(test_random_forest_df1a, by="Region")
train_random_forest_df1a <-na.omit(train_random_forest_df1a)
train_random_forest_df1a
## # A tibble: 0 x 18
## # ... with 18 variables: V1 <dbl>, V2 <dbl>, V3 <dbl>, V4 <dbl>, V5 <dbl>,
## # V6 <dbl>, V7 <dbl>, V8 <dbl>, V9 <dbl>, V10 <dbl>, V11 <dbl>,
## # V12 <dbl>, V13 <dbl>, V14 <dbl>, V15 <dbl>, V16 <dbl>, Region <chr>,
## # success <fct>
library(randomForest)
## Warning: package 'randomForest' was built under R version 3.4.4
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:gridExtra':
##
## combine
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
rf1a <- randomForest(success~., data=test_random_forest_df1a %>% select(-Region))
rf1a
##
## Call:
## randomForest(formula = success ~ ., data = test_random_forest_df1a %>% select(-Region))
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 10.86%
## Confusion matrix:
## 1 0 class.error
## 1 13892 0 0
## 0 1692 0 1
test_predictions1a <- predict(rf1a, newdata=train_random_forest_df1a %>% select(-Region))
table(pred=test_predictions1a, observed=train_random_forest_df1a$success)
## observed
## pred 1 0
## 1 0 0
## 0 0 0
Lets take a t-statistic of the linear fit for success-rate and year to see if there is a linear dependent. We can set our null hypothesis as no relationship between success-rate vs time. From the statistics we would not reject the null hypothesis because from the t-statistics we can see that our probability is 0.206 which is rather large for a pvalue, thus we can imply the probability of the null hypothesis being true is likely which is why we would accept the null hypothesis. Our results also match our intuition about the graph of the relationship above, which also exhibited no appareant linear relationships.
set.seed(1234)
# create a linear regression model for the data
regression_fit = lm(success_rate~Year, data = standardized_df1,family = binomial)
## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
## extra argument 'family' will be disregarded
regression_fit %>%
tidy()
## term estimate std.error statistic p.value
## 1 (Intercept) 5.384201734 4.246757723 1.267838 0.2063598
## 2 Year -0.002253677 0.002114907 -1.065615 0.2879086
For more information about logistical regression visit this link!
This tutorial displayed some of the many features that we can utilize in R to represent the full data science pipeline. Many different approaches can be made as far as handling, displaying and analyzing the data, which makes R a great language for data science. Overall, the dataset for terrorist attacks provided an opportunity to learn some interesting information based off of a few simple techniques that data science employs all the time!